Population and Job forecasting
Authored by: Rhutuvaruni Kharade and Tharusha Chao
Duration: 100 mins
Level: Intermediate
Pre-requisite Skills: Python, PowerBI, Tableau
Scenario

As a citizen and a job seeker, I want to find a job in the field that currently has a lot of demand in this area. Job seekers are often worried about which jobs are in high demand or in low demand? Which industry will have higher number of jobs in future ? Which areas have highest numbers of jobs based on a particular industry ? Its important for them to know which jobs they should study for and which city has highest concentration of such jobs so they can plan their living and stay in such areas. By understanding the number of jobs based on a cities, job seekers can ensure that they can get a job or live in close vicinity of that job if they live in a particular areas.

As a business owner I want to establish my business where there is a high demand for my service. I want to make sure that the city I am planning to establish my business in has a higher population so I can attract more customers and potential candidates to work for my company. Business owners need to know in which city they should establish their company in. This is based on the population of the area. Number of people living in a particular area can motivate business owners to have their business in such locations. This will also attract many job seekers in these places which will help the businesses find the right candidates (employees) to work for them.

What this use case will teach you

At the end of this use case you will -

  • Learn how to make open the data in the form of pandas dataframe and save it for futher use.
  • Have learned how to clean, transform, analyze, visualize data and report outcomes (findings) effectively.
  • Use the data to create effective visualization like scatterplots, heatmaps, histograms, etc to aid an understanding of the data to technical and non-technical readers
  • using dashboards for effective story telling to both techical and non-technical audiences.
  • Using version control tools to collaborate and contribute to the project//
Introduction

This project focuses on finding the relationship between jobs and population. Jobs in different industry impact the population in the area. This project will look at this impact and will also give reader a brief idea about how population changes with respect to the number of jobs in a particular area. The data is sourced from City of Melbourne Open Data which is open source and will be used throughout this project. The dataset will be cleaned, transformed, analyzed, visualized and finally relevant insights will be reported and documented. These findings will help stakeholders, policy makers and other readers for futher decision making.

Datasets Used

1 : city-of-melbourne-jobs-forecasts-by-small-area-2020-2040

2 : city-of-melbourne-population-forecasts-by-small-area-2020-2040

Importing libraries
In [3150]:
# importing libraries 
import pandas as pd 
import seaborn as sns 
import numpy as np 
import requests
import os 
import matplotlib.pyplot as plt 
#import missingno as msno
Connecting to Dataset and Testing
In [3151]:
#job = pd.read_csv("city-of-melbourne-jobs-forecasts-by-small-area-2020-2040.csv")
job_data = pd.read_csv("https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/city-of-melbourne-jobs-forecasts-by-small-area-2020-2040/exports/csv?lang=en&timezone=Australia%2FSydney&use_labels=true&delimiter=%2C")
job = job_data.copy()
In [3152]:
job.head()
Out[3152]:
Geography Year Category Industry Space Use Value
0 City of Melbourne 2023 Jobs by industry Accommodation 10286
1 City of Melbourne 2026 Jobs by industry Accommodation 11631
2 City of Melbourne 2032 Jobs by industry Accommodation 13207
3 City of Melbourne 2034 Jobs by industry Accommodation 13420
4 City of Melbourne 2035 Jobs by industry Accommodation 13529
In [3153]:
# to get a basic understanding of variables and null values
job.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9114 entries, 0 to 9113
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Geography           9114 non-null   object
 1   Year                9114 non-null   int64 
 2   Category            9114 non-null   object
 3   Industry Space Use  9114 non-null   object
 4   Value               9114 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 356.1+ KB
In [3154]:
# To see total numbers of rows and columns 
job.shape
Out[3154]:
(9114, 5)
In [3155]:
# Renaming columns for ease, making them all lowercase. 
job = job.rename(columns = {"Geography":"city", "Year":"year", "Category":"category", "Industry Space Use": "industry", "Value": "value"} )
In [3156]:
# In geography columns, to see which unique location it consists of
job.city.unique()
Out[3156]:
array(['City of Melbourne', 'Carlton', 'Docklands', 'East Melbourne',
       'Kensington', 'Melbourne (CBD)', 'Melbourne (Remainder)',
       'North Melbourne', 'Parkville', 'Port Melbourne', 'South Yarra',
       'Southbank', 'West Melbourne (Industrial)',
       'West Melbourne (Residential)'], dtype=object)
In [3157]:
# to check unique years this data is from
job.year.unique()
Out[3157]:
array([2023, 2026, 2032, 2034, 2035, 2021, 2025, 2037, 2040, 2027, 2028,
       2036, 2041, 2029, 2031, 2038, 2024, 2030, 2039, 2022, 2033],
      dtype=int64)
In [3158]:
# to understand uniqe categories in category columns 
# job.category.unique()
job
Out[3158]:
city year category industry value
0 City of Melbourne 2023 Jobs by industry Accommodation 10286
1 City of Melbourne 2026 Jobs by industry Accommodation 11631
2 City of Melbourne 2032 Jobs by industry Accommodation 13207
3 City of Melbourne 2034 Jobs by industry Accommodation 13420
4 City of Melbourne 2035 Jobs by industry Accommodation 13529
... ... ... ... ... ...
9109 West Melbourne (Residential) 2025 Jobs by space use Total jobs 5454
9110 West Melbourne (Residential) 2026 Jobs by space use Total jobs 5618
9111 West Melbourne (Residential) 2029 Jobs by space use Total jobs 6118
9112 West Melbourne (Residential) 2033 Jobs by space use Total jobs 6717
9113 West Melbourne (Residential) 2036 Jobs by space use Total jobs 7027

9114 rows × 5 columns

In [3159]:
job.industry.unique()
Out[3159]:
array(['Accommodation', 'Admin and support services',
       'Agriculture and mining', 'Arts and recreation services',
       'Business services', 'Construction', 'Education and training',
       'Other services', 'Electricity, gas, water and waste services',
       'Finance and insurance', 'Food and beverage services',
       'Health care and social assistance',
       'Information media and telecommunications', 'Manufacturing',
       'Public administration and safety', 'Real estate services',
       'Rental and hiring services', 'Retail trade',
       'Transport, postal and storage', 'Wholesale trade',
       'Accommodation - Commercial', 'Education',
       'Entertainment - Hospitality', 'Entertainment - Other', 'Health',
       'Industrial', 'Office', 'Other - Employment', 'Retail',
       'Total jobs'], dtype=object)
About Jobs Forecast Dataset.

This dataset provides jobs forecasts by single year for 2021 to 2041. Prepared by SGS Economics and Planning (Jan-Jun 2022), forecasts are available for the municipality and small areas, as well as by industry and space use type. This dataset has total of 9114 rows which indicates total cases and 5 columns which indicate variables(features). The dataset contains the following variables (features):

  • city: Geographical area (Melbourne LGA or small areas used for the City of Melbourne's CLUE analysis). Small areas mostly correspond to traditional suburb boundaries. This is a categorical variable of type object. This variable takes value from the following list of locations: 'City of Melbourne', 'Carlton', 'Docklands', 'East Melbourne','Kensington', 'Melbourne (CBD)', 'Melbourne (Remainder)','North Melbourne', 'Parkville', 'Port Melbourne', 'South Yarra','Southbank', 'West Melbourne (Industrial)','West Melbourne (Residential)'

  • year: This variables shows the year the job were created. This is a numerical variable of type object. Year is from 2022 to 2040.

  • category: This variables shows the category of the job that was created in that particular year. This is a numerical variable of type object. There are two categories in this variables however we will only look at category by industry.

  • industry_space_use: This variables shows the space used by a particular industry that was established in given year. This is a numerical variable of type object

  • value: This variables shows the number of jobs created by a particular industry that was established in given year. This is a numerical variable of type integer

In [3160]:
#pop = pd.read_csv("city-of-melbourne-population-forecasts-by-small-area-2020-2040.csv")
pop = pd.read_csv("https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/city-of-melbourne-population-forecasts-by-small-area-2020-2040/exports/csv?lang=en&timezone=Australia%2FSydney&use_labels=true&delimiter=%2C")
# pop          
In [3161]:
pop.head()
Out[3161]:
Geography Year Gender Age Value
0 City of Melbourne 2021 Female Age 0-4 1943.0
1 City of Melbourne 2023 Female Age 0-4 2477.0
2 City of Melbourne 2026 Female Age 0-4 3595.0
3 City of Melbourne 2028 Female Age 0-4 4083.0
4 City of Melbourne 2034 Female Age 0-4 5242.0
In [3162]:
pop.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17052 entries, 0 to 17051
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Geography  17052 non-null  object 
 1   Year       17052 non-null  int64  
 2   Gender     17052 non-null  object 
 3   Age        17052 non-null  object 
 4   Value      16989 non-null  float64
dtypes: float64(1), int64(1), object(3)
memory usage: 666.2+ KB
In [3163]:
pop.shape
Out[3163]:
(17052, 5)
In [3164]:
# Renaming columns for ease, making them all lowercase. 
pop = pop.rename(columns = {"Geography":"city", "Year":"year", "Gender":"gender", "Age": "age", "Value": "value"} )
In [3165]:
pop.city.unique()
Out[3165]:
array(['City of Melbourne', 'Carlton', 'Docklands', 'East Melbourne',
       'Kensington', 'Melbourne (CBD)', 'Melbourne (Remainder)',
       'North Melbourne', 'Parkville', 'Port Melbourne', 'South Yarra',
       'Southbank', 'West Melbourne (Industrial)',
       'West Melbourne (Residential)'], dtype=object)
In [3166]:
pop.gender.unique()
Out[3166]:
array(['Female', 'Male', 'Total', 'Not applicable'], dtype=object)
In [3167]:
pop.age.unique()
Out[3167]:
array(['Age 0-4', 'Age 5-9', 'Age 10-14', 'Age 15-19', 'Age 20-24',
       'Age 25-29', 'Age 30-34', 'Age 35-39', 'Age 40-44', 'Age 45-49',
       'Age 50-54', 'Age 55-59', 'Age 60-64', 'Age 65-69', 'Age 70-74',
       'Age 75-79', 'Age 80-84', 'Age 85+', 'Average age',
       'Total population'], dtype=object)
In [3168]:
pop.year.unique()
Out[3168]:
array([2021, 2023, 2026, 2028, 2034, 2037, 2038, 2024, 2025, 2040, 2031,
       2033, 2022, 2032, 2036, 2041, 2027, 2029, 2039, 2035, 2030],
      dtype=int64)
In [3169]:
pop
Out[3169]:
city year gender age value
0 City of Melbourne 2021 Female Age 0-4 1943.0
1 City of Melbourne 2023 Female Age 0-4 2477.0
2 City of Melbourne 2026 Female Age 0-4 3595.0
3 City of Melbourne 2028 Female Age 0-4 4083.0
4 City of Melbourne 2034 Female Age 0-4 5242.0
... ... ... ... ... ...
17047 West Melbourne (Residential) 2029 Not applicable Total population 11673.0
17048 West Melbourne (Residential) 2032 Not applicable Total population 13191.0
17049 West Melbourne (Residential) 2035 Not applicable Total population 14820.0
17050 West Melbourne (Residential) 2039 Not applicable Total population 14794.0
17051 West Melbourne (Residential) 2041 Not applicable Total population 14814.0

17052 rows × 5 columns

About Population Forecast Dataset.

This dataset provides Population forecasts by single year for 2021 to 2041. Prepared by SGS Economics and Planning (Jan-Jun 2022), forecasts are available for the municipality and small areas, as well as by industry and space use type. This dataset has total of 17052 rows which indicates total cases and 5 columns which indicate variables(features). The dataset contains the following variables (features):

  • city: Geographical area (Melbourne LGA or small areas used for the City of Melbourne's CLUE analysis). Small areas mostly correspond to traditional suburb boundaries. This is a categorical variable of type object. This variable takes value from the following list of locations: 'City of Melbourne', 'Carlton', 'Docklands', 'East Melbourne','Kensington', 'Melbourne (CBD)', 'Melbourne (Remainder)','North Melbourne', 'Parkville', 'Port Melbourne', 'South Yarra','Southbank', 'West Melbourne (Industrial)','West Melbourne (Residential)'

  • year: This variables shows the year the job were created. This is a numerical variable of type integer. Year is from 2022 to 2040.

  • gender: This variables shows the gender of the person who is a part of the overall population. Gender is divided into four categories: Female, Male, Total and not applicable. This is a categorical nominal variable with type object

  • age: This variables shows the age of a person who is a part of the population in this study. This is a numerical variable of type object

  • value: This variables shows the population of the area for a given year. This is a numerical variable of type float

Checking and Handling missing values.

For population dataset

In [3170]:
# To check if there is any missing values in any columns 
pop.isnull().any()
Out[3170]:
city      False
year      False
gender    False
age       False
value      True
dtype: bool
In [3171]:
# There are no null values in job dataset. However there is null (missing) values in population dataset in value column. 
# Lets find out how many null values are there in value column
print(pop.value.isnull().sum())


# Write more on why you chose to drop rows 
63
In [3172]:
# pop
In [3173]:
pop = pop.dropna(axis= "rows")
In [3174]:
pop.isnull().any()
Out[3174]:
city      False
year      False
gender    False
age       False
value     False
dtype: bool

All missing values in "value" column of population dataset have been removed.

In [3175]:
# Choose all data where the gender is not applicable
pop = pop.set_index("gender")
other_pop = pop.loc[["Total", "Not applicable"],:]
other_pop                                         # saving total and not applicable data into differnet variable for further use. 
pop = pop.loc[["Female", "Male"],:]               # seperating male and female gender from the whole dataset
pop.reset_index(inplace=True)
In [3176]:
pop       # df where gender = ['Male', 'Female']
Out[3176]:
gender city year age value
0 Female City of Melbourne 2021 Age 0-4 1943.0
1 Female City of Melbourne 2023 Age 0-4 2477.0
2 Female City of Melbourne 2026 Age 0-4 3595.0
3 Female City of Melbourne 2028 Age 0-4 4083.0
4 Female City of Melbourne 2034 Age 0-4 5242.0
... ... ... ... ... ...
11125 Male West Melbourne (Residential) 2024 Average age 31.0
11126 Male West Melbourne (Residential) 2026 Average age 32.0
11127 Male West Melbourne (Residential) 2027 Average age 32.0
11128 Male West Melbourne (Residential) 2037 Average age 33.0
11129 Male West Melbourne (Residential) 2038 Average age 33.0

11130 rows × 5 columns

In [3177]:
other_pop    # df where all values from gender = ['total', 'not applicable']
other_pop.reset_index(drop=True)
other_pop
Out[3177]:
city year age value
gender
Total City of Melbourne 2021 Age 0-4 4051.0
Total City of Melbourne 2021 Age 20-24 33134.0
Total City of Melbourne 2021 Age 30-34 21485.0
Total City of Melbourne 2021 Age 50-54 4939.0
Total City of Melbourne 2021 Age 65-69 3698.0
... ... ... ... ...
Not applicable West Melbourne (Residential) 2029 Total population 11673.0
Not applicable West Melbourne (Residential) 2032 Total population 13191.0
Not applicable West Melbourne (Residential) 2035 Total population 14820.0
Not applicable West Melbourne (Residential) 2039 Total population 14794.0
Not applicable West Melbourne (Residential) 2041 Total population 14814.0

5859 rows × 4 columns

The population dataset now only contains data where the gender is either female or male. Another dataframe has been created to save the data from gender whose value was "Not applicable" and "Total" for futher analysis.

In [3178]:
job.isnull().any()
Out[3178]:
city        False
year        False
category    False
industry    False
value       False
dtype: bool
In [3179]:
job.head(5)
Out[3179]:
city year category industry value
0 City of Melbourne 2023 Jobs by industry Accommodation 10286
1 City of Melbourne 2026 Jobs by industry Accommodation 11631
2 City of Melbourne 2032 Jobs by industry Accommodation 13207
3 City of Melbourne 2034 Jobs by industry Accommodation 13420
4 City of Melbourne 2035 Jobs by industry Accommodation 13529

There are no missing values in job dataset hence we do not need to perfrom any operations to remove null or missing values. However the industry_space_use column has values where the industry is "Total jobs" which is not accurate so we will remove the rows where the column has such values and reserve it in a different variable for later analysis.

In [3180]:
total_job_by_ind_space_use = job.set_index("industry").loc[['Total jobs'], : ]
job = job.set_index("industry").loc[['Accommodation', 'Admin and support services',
       'Agriculture and mining', 'Arts and recreation services',
       'Business services', 'Construction', 'Education and training',
       'Other services', 'Electricity, gas, water and waste services',
       'Finance and insurance', 'Food and beverage services',
       'Health care and social assistance',
       'Information media and telecommunications', 'Manufacturing',
       'Public administration and safety', 'Real estate services',
       'Rental and hiring services', 'Retail trade',
       'Transport, postal and storage', 'Wholesale trade',
       'Accommodation - Commercial', 'Education',
       'Entertainment - Hospitality', 'Entertainment - Other', 'Health',
       'Industrial', 'Office', 'Other - Employment', 'Retail'],:]
job.head(5)
Out[3180]:
city year category value
industry
Accommodation City of Melbourne 2023 Jobs by industry 10286
Accommodation City of Melbourne 2026 Jobs by industry 11631
Accommodation City of Melbourne 2032 Jobs by industry 13207
Accommodation City of Melbourne 2034 Jobs by industry 13420
Accommodation City of Melbourne 2035 Jobs by industry 13529
In [3181]:
total_job_by_ind_space_use.head(5)
Out[3181]:
city year category value
industry
Total jobs City of Melbourne 2022 Jobs by industry 473514
Total jobs City of Melbourne 2023 Jobs by industry 483587
Total jobs City of Melbourne 2026 Jobs by industry 519577
Total jobs City of Melbourne 2027 Jobs by industry 530590
Total jobs City of Melbourne 2030 Jobs by industry 563548

Now we only have job data where each industry belongs to a speceific category and not "Total jobs" category. Howevrer, the data from "total jobs" category is saved as dataframe for futher analysis.

In [3182]:
job= job.reset_index()
In [3183]:
job.shape
Out[3183]:
(8526, 5)
In [3184]:
pop.shape
Out[3184]:
(11130, 5)
Exploratory Data Analysis on Job Dataset.
In [3185]:
# changing data types of the variables if they are not accurate 
job.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8526 entries, 0 to 8525
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   industry  8526 non-null   object
 1   city      8526 non-null   object
 2   year      8526 non-null   int64 
 3   category  8526 non-null   object
 4   value     8526 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 333.2+ KB
In [3186]:
job.head(5)
Out[3186]:
industry city year category value
0 Accommodation City of Melbourne 2023 Jobs by industry 10286
1 Accommodation City of Melbourne 2026 Jobs by industry 11631
2 Accommodation City of Melbourne 2032 Jobs by industry 13207
3 Accommodation City of Melbourne 2034 Jobs by industry 13420
4 Accommodation City of Melbourne 2035 Jobs by industry 13529

Now we will have a look at the outliers an overall distribution of the industries and the jobs created by those industries (value)

In [3187]:
sns.boxplot(x=job.value, y=job.industry, data=job )
plt.xscale('log')
plt.figure(figsize=(100,100))
plt.show()

# job.describe()
No description has been provided for this image
<Figure size 10000x10000 with 0 Axes>
In [3188]:
# HANDLING OUTLIERS WITH PERCENTILE METHOD (WINSORIZATION)
for ind in ind_list:
    data=job.loc[job.industry.isin([ind])]
    upper_limit = data.value.quantile(0.85)  # tried 95 percentile but didnt remove many outliers so 85
    lower_limit = data.value.quantile(0.15)
    
    data[(data.value <= upper_limit) & (data.value >= lower_limit)]
    data_1= data.copy()
    data_1.value = np.where(data_1.value >= upper_limit,upper_limit, np.where(data_1.value <= lower_limit,lower_limit, data_1.value))
    job.loc[job.industry.isin([ind])] = data_1
    
sns.boxplot(x="value", y="industry", data=job )
plt.xscale('log')
plt.figure(figsize=(100,100))
plt.show()
No description has been provided for this image
<Figure size 10000x10000 with 0 Axes>
In [3189]:
# Hisotgram of jobs with 20 bins 
sns.histplot(job.value,bins=20)
plt.xlabel("Value of  Jobs")
plt.figure(figsize=(10,10))
plt.show()
No description has been provided for this image
<Figure size 1000x1000 with 0 Axes>
In [3190]:
# increasing bin size to see any major change 
sns.histplot(job.value, bins=100)
Out[3190]:
<Axes: xlabel='value', ylabel='Count'>
No description has been provided for this image
In [3191]:
# Cumulative Counts and Empirical Cumulatie ditribution Function 
sns.histplot(job.value,stat="percent", cumulative=True, color="lightgray")
plt.show()
No description has been provided for this image
In [3192]:
# data by industry 
# all indsutries name
data = (job
        .loc[:, ["industry", "year", "value","city"]]
        .set_index("industry")
        .sort_values("year")
        .reset_index()

       )  # selecting data based on condition. 
data.head(5)
    
Out[3192]:
industry year value city
0 Office 2021 78886.10 Melbourne (CBD)
1 Health 2021 308.00 Kensington
2 Retail trade 2021 83.00 West Melbourne (Industrial)
3 Admin and support services 2021 2372.85 City of Melbourne
4 Education 2021 370.90 East Melbourne
In [3193]:
# legend and colours  settings
# This is just to show users the legend based on city. 
p = sns.scatterplot(data=data,x="year", y="value", hue="city", legend=True )
sns.move_legend(p, "upper right", bbox_to_anchor=(1.7, 1))

# City of Melbourne and Mebourne CSB will see the sharpest rise in jobs 
No description has been provided for this image
In [3194]:
# axis code for subplot location 
fig, axs = plt.subplots(7, 4)  # sharex="all",sharey="all" removed
fig.suptitle('Scatterplot for No. of Jobs per city', fontsize=16)
fig.set_figwidth(20)
fig.set_figheight(20)


for ax in axs.flat:
    ax.set(xlabel='year', ylabel='value')
    # ax.set_xticklabels([])


# Hide x labels and tick labels for top plots and y ticks for right plots.
for ax in axs.flat:
    ax.label_outer()

# code that goes through names 
# of all industries from ind_list 
# and draws scatterplots
x = 0
while x < 28:
    x = 0
    name = ind_list[x]
    for i in range(7):
        for j in range(4):        
            axs[i,j].set_title(name)
            sns.scatterplot(data=data.loc[data.industry.isin([name])],x="year", y="value", hue="city", legend=False,ax=axs[i,j])
           
            
            # print(i, j ,name)
            x+=1
            name = ind_list[x]
No description has been provided for this image
Insights on job dataset from the visualiztions

The job dataset is extremely right skewed which indicates that there are more lower values than there are higher values in the dataset. This can be seen throught the histogram of the job dataset. The histogram indicates that over 7000 jobs were created where the number of jobs is between 0 to 10000. By increasing the bin size we can see that in the histogram mos

t. The cumulative distribution graph shows that around 90% of the jobs are between the value of 0 and 1000

  1. The scatterplot is drawn for the change in number of job created (value) by year. This data goes from the year 2020 to 2040. The scatterplot is color coded to show the change in particular industry based on city. Each city is shown by different colour.

From scatterplot we see the following :

  • While Public administration and safety industry will see a steady rise in jobs in most cities over the time period, Port Melbourne will see a stepeer rise in jobs from 8000 in 2020 to about 10000 by 2040.

  • For Education industry, many cities will see rapid growth in number of jobs. While most cities see steady growth, sharp rise in seen in Port Melbourne, South Yarra and Docklands where the number of jobs increase from 6000 to 8000, 1000 to 4000 and 1000 to 3000 respectively.

  • A steep rise is seen in the number of jobs in Business Services industry in Melbourne (CBD), Carlton and Docklands where the number increases from 7000 to about 10000 jobs, 1000 to 3000 jobs and 1000 and 5000 jobs respectively.

  • For Admin and support services industry most cities have a steady rise in jobs, while Dockland will see a steep rise in number of jobs from 2030 to 2040 where the number of jobs will increase from 2000 to 10000 jobs .

  • A steady decline of number of Wholesale trade jobs will be seen in the city of Kensington from 300 jobs to 200 jobs while a steady growth will be seen in Docklands from 700 to more than 800. Most other cities do not see any significant change in the number of jobs.

  • We dont see any significant change in the number of jobs for Electricity, gas, water and waste services industry in most cities. Most numebr of jobs in this industry are located in Melbourne(remainder), East Melbourne and Carlton.

  • Transport, postal and storage industry will see a rise in number of jobs in most cities except City of Melbourne where the number of jobs increase to about 800 jobs from 2020 to 2030 followed by a decline from 800 jobs to only 600 jobs from 2030 to 2040.

  • Sharp rise in number of jobs in Retail trade industry can be seenin the cities of North Melbourne, Kensington and Melbourne (Remainder) where the number of jobs increase from 700 to over 1000 jobs, 600 to over 800 jobs and 200 to over 400 jobs respectively from 2020 to 2040.

  • While there is a steady rise in number of Construction jobs in Carlton from 2020 to 2030, a sharp rise can be seen from 1200 jobs to about 1700 jobs from 2030 to 2040. For docklands there is a steep rise from 600 jobs to 1000 jobs from 2020 to 2030 and a steady growth till 2040.

  • A steep rise in the number of Rental and hiring services jobs can be seen from 2030 to 2040 in city of of Melborune from 500 jobs in 2020 to around 1000 jobs by 2040. Same trend can be seen for the city of North Melbourne where the number increase from 1200 jobs to around 2000 jobs from 2020 to 2040.

  • No significant change can be seen with Agriculture and mining jobs except for cities of North Melbourne and West Melbourne( Residential ) where there is a steady rise in number of jobs.

  • Rise in real Estate jobs in Kensington from 2020 to 2040 can be seen where the number of jobs increased from 1500 to around 2000 jobs

  • The number of jobs in Finance and insurance industry will increase in Melborune (Remainder) from 25000 jobs to around 30000 jobs.

  • A sharp decline in Manufacturing jobs can be seen in the city of Kensington from 30000 jobs in 2020 to around 21000 jobs by 2040. Port Melbourne also sees a steady decline of Manufactuing jobs from 2020 to 2040.

  • No significant rise can be seen in Information media and telecommunication jobs except a few additional jbs in Melbourne (Remainder) .

  • Most cities will see a steady rise in number of Food and beverages services jobs with a significant rise in Parkville from 20000 jobs in 2020 to around 30000 jobs by 2040.

  • A steep rise in the number of jobs in Health care and social assitance can be observed from 2020 to 2040 in the cities of Port Melbourne and North melbourne from around 2000 jobs to over 6000 jobs. For the city of West Melbourne(Indsutrial) there was a steady rise in jobs from 2020 to 2035 and a significant rise from 2035 to 2040 with number of jobs increasing from less than 2000 to more than 4000 jobs. Around 2000 jobs were added in the city of Melbourne (Remainder), Southbank and South Yarra from 2020 to 2040.

  • Steady rise can be seen in most cities for Arts and recreation services industry. Around 2000 jobs were added in Carlton f or this industry from 2020 to 2040.

-Other services saw a rise in all cities with a significant rise of about 4000 additional jobs in Southbank area from 2020 to 2040.

  • Entertainment-Hospitality industry saw a rise in most cities with singificant changes in teh city of South Yarra from around 5000 jo bs in 2020 t-round 8000 jobs by 2040.

-Accomodation - Commercial and Accomodation indsutries saw a steady growth in number of jobs in most cities.

  • A rapid growth in number of jobs for Retail indsutry can be seen in cities of Port Melbourne, City of Melbourne and Mebourne (remainder) with number of jobs rising from 1700 to 2500 jobs, 1500 to over 2000 jobs, 500 jobs to around 1000 jobs in the given cities respectively.

  • Office and Entertainment-Other industries have seen a steady rise in number of jobs in most cities.

  • Significant rise in number of jobs for Eduaction and training indsutry can be seen in the city fo Carlton and Parkville where the number increased from 2000 jobs to more than 4000 jobs and around 500 jobs to round 2000 jobs in the given cities respectively.

  • A rapid growth in number of jobs can be seen in Health industry in the cities of Port Melbourne and City of Melbourne where the number increased from around 1000 jobs to more than 4000 jobs and 500 jobs to around 3000 jobs in given cities respectively. Docklands has also seen a rise from just few hundreds of jobs to around 2000 jobs from 2020 to 2040. A few jobs were added in Melborune(remainder) and Carlton.

  • No significant change can be seen in the number of jobs for I d rial industry with a decline of jobs in few cities.

Exploratory Data Analysis on Population Dataset.
In [3195]:
avg_age_df = pop[pop.age.isin(["Average age"])]
avg_age_df
Out[3195]:
gender city year age value
1753 Female City of Melbourne 2023 Average age 33.0
1754 Female City of Melbourne 2028 Average age 33.0
1755 Female City of Melbourne 2029 Average age 33.0
1756 Female City of Melbourne 2030 Average age 33.0
1757 Female City of Melbourne 2031 Average age 33.0
... ... ... ... ... ...
11125 Male West Melbourne (Residential) 2024 Average age 31.0
11126 Male West Melbourne (Residential) 2026 Average age 32.0
11127 Male West Melbourne (Residential) 2027 Average age 32.0
11128 Male West Melbourne (Residential) 2037 Average age 33.0
11129 Male West Melbourne (Residential) 2038 Average age 33.0

546 rows × 5 columns

In [3196]:
pop.age.unique()
Out[3196]:
array(['Age 0-4', 'Age 5-9', 'Age 10-14', 'Age 15-19', 'Age 20-24',
       'Age 25-29', 'Age 30-34', 'Age 35-39', 'Age 40-44', 'Age 45-49',
       'Age 50-54', 'Age 55-59', 'Age 60-64', 'Age 65-69', 'Age 70-74',
       'Age 75-79', 'Age 80-84', 'Age 85+', 'Average age'], dtype=object)
In [3197]:
pop.tail(5)   # before removing averge age rows
Out[3197]:
gender city year age value
11125 Male West Melbourne (Residential) 2024 Average age 31.0
11126 Male West Melbourne (Residential) 2026 Average age 32.0
11127 Male West Melbourne (Residential) 2027 Average age 32.0
11128 Male West Melbourne (Residential) 2037 Average age 33.0
11129 Male West Melbourne (Residential) 2038 Average age 33.0
In [3198]:
# removing the rwos where the age is avergae age
pop = pop.set_index("age")
pop = pop.drop(index=("Average age"))
In [3199]:
pop = pop.reset_index()
In [3200]:
pop.tail()   # after removing average age rows in age col
Out[3200]:
age gender city year value
10579 Age 85+ Male West Melbourne (Residential) 2025 2.0
10580 Age 85+ Male West Melbourne (Residential) 2027 3.0
10581 Age 85+ Male West Melbourne (Residential) 2031 5.0
10582 Age 85+ Male West Melbourne (Residential) 2038 8.0
10583 Age 85+ Male West Melbourne (Residential) 2041 8.0
In [3201]:
# female data 
female_data = pop[pop["gender"]=="Female"]
female_data
Out[3201]:
age gender city year value
0 Age 0-4 Female City of Melbourne 2021 1943.0
1 Age 0-4 Female City of Melbourne 2023 2477.0
2 Age 0-4 Female City of Melbourne 2026 3595.0
3 Age 0-4 Female City of Melbourne 2028 4083.0
4 Age 0-4 Female City of Melbourne 2034 5242.0
... ... ... ... ... ...
5287 Age 85+ Female West Melbourne (Residential) 2027 6.0
5288 Age 85+ Female West Melbourne (Residential) 2028 6.0
5289 Age 85+ Female West Melbourne (Residential) 2032 10.0
5290 Age 85+ Female West Melbourne (Residential) 2034 12.0
5291 Age 85+ Female West Melbourne (Residential) 2039 15.0

5292 rows × 5 columns

In [3202]:
age_list = ['Age 0-4', 'Age 5-9', 'Age 10-14', 'Age 15-19', 'Age 20-24',
       'Age 25-29', 'Age 30-34', 'Age 35-39', 'Age 40-44', 'Age 45-49',
       'Age 50-54', 'Age 55-59', 'Age 60-64', 'Age 65-69', 'Age 70-74',
       'Age 75-79', 'Age 80-84', 'Age 85+']

data = (female_data
       .groupby(["age", "city"])
       .size()
       .rename("counts")
       .reset_index())
data
plt.figure(figsize=(8,8))
sns.barplot(data= data, y="counts", x="age", palette="Paired", hue="city")
plt.xticks(rotation=90)

plt.show()


# From this data we can see that the age demographic that make up most female population in this dataset is age 25-29 making up population of 23164.00
No description has been provided for this image
In [3203]:
pop
Out[3203]:
age gender city year value
0 Age 0-4 Female City of Melbourne 2021 1943.0
1 Age 0-4 Female City of Melbourne 2023 2477.0
2 Age 0-4 Female City of Melbourne 2026 3595.0
3 Age 0-4 Female City of Melbourne 2028 4083.0
4 Age 0-4 Female City of Melbourne 2034 5242.0
... ... ... ... ... ...
10579 Age 85+ Male West Melbourne (Residential) 2025 2.0
10580 Age 85+ Male West Melbourne (Residential) 2027 3.0
10581 Age 85+ Male West Melbourne (Residential) 2031 5.0
10582 Age 85+ Male West Melbourne (Residential) 2038 8.0
10583 Age 85+ Male West Melbourne (Residential) 2041 8.0

10584 rows × 5 columns

In [3204]:
(pop
      .loc[:, ["gender", "value"]]
      .groupby("gender")
      .size())


# The dataset contains equal number of observations for both male and female
Out[3204]:
gender
Female    5292
Male      5292
dtype: int64
In [3205]:
(pop
.loc[:, ['age', 'city', 'gender']]
.groupby(["gender", 'age'])
.size()
.rename("counts")
.reset_index())

# Our dataset contains equal distribution of both male and female in each 
# age group 
Out[3205]:
gender age counts
0 Female Age 0-4 294
1 Female Age 10-14 294
2 Female Age 15-19 294
3 Female Age 20-24 294
4 Female Age 25-29 294
5 Female Age 30-34 294
6 Female Age 35-39 294
7 Female Age 40-44 294
8 Female Age 45-49 294
9 Female Age 5-9 294
10 Female Age 50-54 294
11 Female Age 55-59 294
12 Female Age 60-64 294
13 Female Age 65-69 294
14 Female Age 70-74 294
15 Female Age 75-79 294
16 Female Age 80-84 294
17 Female Age 85+ 294
18 Male Age 0-4 294
19 Male Age 10-14 294
20 Male Age 15-19 294
21 Male Age 20-24 294
22 Male Age 25-29 294
23 Male Age 30-34 294
24 Male Age 35-39 294
25 Male Age 40-44 294
26 Male Age 45-49 294
27 Male Age 5-9 294
28 Male Age 50-54 294
29 Male Age 55-59 294
30 Male Age 60-64 294
31 Male Age 65-69 294
32 Male Age 70-74 294
33 Male Age 75-79 294
34 Male Age 80-84 294
35 Male Age 85+ 294
In [3206]:
(pop
.loc[:, ['gender', 'year', 'value','city']]
.groupby(["city", "year"])
.size()
.rename("population_count")
.reset_index)
Out[3206]:
<bound method Series.reset_index of city                          year
Carlton                       2021    36
                              2022    36
                              2023    36
                              2024    36
                              2025    36
                                      ..
West Melbourne (Residential)  2037    36
                              2038    36
                              2039    36
                              2040    36
                              2041    36
Name: population_count, Length: 294, dtype: int64>
In [3207]:
(pop
.loc[:, ['gender', 'year', 'value','city']]
.groupby(["gender", "year"])
.size()
.rename("population_count")
.reset_index)
Out[3207]:
<bound method Series.reset_index of gender  year
Female  2021    252
        2022    252
        2023    252
        2024    252
        2025    252
        2026    252
        2027    252
        2028    252
        2029    252
        2030    252
        2031    252
        2032    252
        2033    252
        2034    252
        2035    252
        2036    252
        2037    252
        2038    252
        2039    252
        2040    252
        2041    252
Male    2021    252
        2022    252
        2023    252
        2024    252
        2025    252
        2026    252
        2027    252
        2028    252
        2029    252
        2030    252
        2031    252
        2032    252
        2033    252
        2034    252
        2035    252
        2036    252
        2037    252
        2038    252
        2039    252
        2040    252
        2041    252
Name: population_count, dtype: int64>
In [3208]:
sns.boxplot(x=pop.value, y=pop.city, data=pop )
plt.xscale('log')
plt.figure(figsize=(100,100))
plt.show()
No description has been provided for this image
<Figure size 10000x10000 with 0 Axes>
In [3209]:
city_names = list(pop.city.unique())
len(city_names)
city_names
Out[3209]:
['City of Melbourne',
 'Carlton',
 'Docklands',
 'East Melbourne',
 'Kensington',
 'Melbourne (CBD)',
 'Melbourne (Remainder)',
 'North Melbourne',
 'Parkville',
 'Port Melbourne',
 'South Yarra',
 'Southbank',
 'West Melbourne (Industrial)',
 'West Melbourne (Residential)']
In [3210]:
# HANDLING OUTLIERS WITH PERCENTILE METHOD (WINSORIZATION)
for city in city_names:
    data=pop.loc[pop.city.isin([city])]
    upper_limit = data.value.quantile(0.90)  # tried 95 percentile but didnt remove many outliers so 85
    lower_limit = data.value.quantile(0.10)
    
    data[(data.value <= upper_limit) & (data.value >= lower_limit)]
    data_1= data.copy()
    data_1.value = np.where(data_1.value >= upper_limit,upper_limit, np.where(data_1.value <= lower_limit,lower_limit, data_1.value))
    pop.loc[pop.city.isin([city])] = data_1
    
sns.boxplot(x=pop.value, y=pop.city, data=pop )
plt.xscale('log')
plt.figure(figsize=(100,100))
plt.show()
No description has been provided for this image
<Figure size 10000x10000 with 0 Axes>
In [3211]:
pop
Out[3211]:
age gender city year value
0 Age 0-4 Female City of Melbourne 2021 1943.0
1 Age 0-4 Female City of Melbourne 2023 2477.0
2 Age 0-4 Female City of Melbourne 2026 3595.0
3 Age 0-4 Female City of Melbourne 2028 4083.0
4 Age 0-4 Female City of Melbourne 2034 5242.0
... ... ... ... ... ...
10579 Age 85+ Male West Melbourne (Residential) 2025 13.5
10580 Age 85+ Male West Melbourne (Residential) 2027 13.5
10581 Age 85+ Male West Melbourne (Residential) 2031 13.5
10582 Age 85+ Male West Melbourne (Residential) 2038 13.5
10583 Age 85+ Male West Melbourne (Residential) 2041 13.5

10584 rows × 5 columns

In [3212]:
sns.scatterplot(data=pop, x="year", y="value", hue="gender")
plt.title("Scatterplot for population change by year (overall-based on gender) ")
plt.show()
No description has been provided for this image
In [3213]:
plt.figure(figsize=(8,8))
p = sns.scatterplot(data=pop, x="year", y="value", hue="age")
plt.title("Scatterplot for population change by year (overall-based on age group) ")
sns.move_legend(p, "upper right", bbox_to_anchor=(1.3, 1))

plt.show()
No description has been provided for this image
In [3214]:
city_names = list(pop.city.unique())
len(city_names)
city_names
Out[3214]:
['City of Melbourne',
 'Carlton',
 'Docklands',
 'East Melbourne',
 'Kensington',
 'Melbourne (CBD)',
 'Melbourne (Remainder)',
 'North Melbourne',
 'Parkville',
 'Port Melbourne',
 'South Yarra',
 'Southbank',
 'West Melbourne (Industrial)',
 'West Melbourne (Residential)']
In [3215]:
pop
Out[3215]:
age gender city year value
0 Age 0-4 Female City of Melbourne 2021 1943.0
1 Age 0-4 Female City of Melbourne 2023 2477.0
2 Age 0-4 Female City of Melbourne 2026 3595.0
3 Age 0-4 Female City of Melbourne 2028 4083.0
4 Age 0-4 Female City of Melbourne 2034 5242.0
... ... ... ... ... ...
10579 Age 85+ Male West Melbourne (Residential) 2025 13.5
10580 Age 85+ Male West Melbourne (Residential) 2027 13.5
10581 Age 85+ Male West Melbourne (Residential) 2031 13.5
10582 Age 85+ Male West Melbourne (Residential) 2038 13.5
10583 Age 85+ Male West Melbourne (Residential) 2041 13.5

10584 rows × 5 columns

In [3218]:
# axis code for subplot location 
fig, axs = plt.subplots(7, 2)  # sharex="all",sharey="all" removed
fig.suptitle('Scatterplot for No. of Jobs per city', fontsize=16)
fig.set_figwidth(15)
fig.set_figheight(15)


for ax in axs.flat:
    ax.set(xlabel='year', ylabel='value')
    # ax.set_xticklabels([])


# Hide x labels and tick labels for top plots and y ticks for right plots.
for ax in axs.flat:
    ax.label_outer()

# code that goes through names 
# of all industries from ind_list 
# and draws scatterplots
x = 0
while x <= 14:
    x = 0
    name = city_names[x]
    for i in range(7):
        for j in range(2):        
            axs[i,j].set_title(name)
            sns.scatterplot(data=pop.loc[pop.city.isin([name])],x="year", y="value", hue="gender", legend=False,ax=axs[i,j],alpha=0.6)
           
            
            # print(i, j ,name)
            x+=1
            name = city_names[x]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[3218], line 32
     30 # print(i, j ,name)
     31 x+=1
---> 32 name = city_names[x]

IndexError: list index out of range
No description has been provided for this image
In [3225]:
# axis code for subplot location 
fig, axs = plt.subplots(7, 2)  # sharex="all",sharey="all" removed
# fig.suptitle('Scatterplot for No. of Jobs per city')
fig.set_figwidth(15)
fig.set_figheight(40)


for ax in axs.flat:
    ax.set(xlabel='year', ylabel='value')
    # ax.set_xticklabels([])


# Hide x labels and tick labels for top plots and y ticks for right plots.
for ax in axs.flat:
    ax.label_outer()

# code that goes through names 
# of all industries from ind_list 
# and draws scatterplots
x = 0
while x <= 14:
    x = 0
    name = city_names[x]
    for i in range(7):
        for j in range(2):        
            axs[i,j].set_title(name)
            sns.scatterplot(data=pop.loc[pop.city.isin([name])],x="year", y="value", hue="age", legend=False,ax=axs[i,j],alpha=0.6)
           
            
            # print(i, j ,name)
            x+=1
            name = city_names[x]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[3225], line 32
     30 # print(i, j ,name)
     31 x+=1
---> 32 name = city_names[x]

IndexError: list index out of range
No description has been provided for this image
In [3226]:
job.value.shape
Out[3226]:
(8526,)
In [3227]:
pop.value.shape
Out[3227]:
(10584,)
In [3228]:
job.city.unique()
Out[3228]:
array(['City of Melbourne', 'Carlton', 'Docklands', 'East Melbourne',
       'Kensington', 'Melbourne (CBD)', 'Melbourne (Remainder)',
       'North Melbourne', 'Parkville', 'Port Melbourne', 'South Yarra',
       'Southbank', 'West Melbourne (Industrial)',
       'West Melbourne (Residential)'], dtype=object)
In [3313]:
pop.year.unique()
Out[3313]:
array([2021, 2023, 2026, 2028, 2034, 2037, 2038, 2031, 2033, 2040, 2041,
       2029, 2039, 2024, 2030, 2022, 2027, 2035, 2036, 2025, 2032],
      dtype=int64)
In [3229]:
pop.city.unique()
Out[3229]:
array(['City of Melbourne', 'Carlton', 'Docklands', 'East Melbourne',
       'Kensington', 'Melbourne (CBD)', 'Melbourne (Remainder)',
       'North Melbourne', 'Parkville', 'Port Melbourne', 'South Yarra',
       'Southbank', 'West Melbourne (Industrial)',
       'West Melbourne (Residential)'], dtype=object)
In [3282]:
city_names
Out[3282]:
['City of Melbourne',
 'Carlton',
 'Docklands',
 'East Melbourne',
 'Kensington',
 'Melbourne (CBD)',
 'Melbourne (Remainder)',
 'North Melbourne',
 'Parkville',
 'Port Melbourne',
 'South Yarra',
 'Southbank',
 'West Melbourne (Industrial)',
 'West Melbourne (Residential)']
In [3323]:
# model example - using carlton as the city for now
df = pop.loc[pop['city']=='Carlton']
df=df.drop(columns = ["age","gender","city"])
# df
from sklearn.linear_model import LinearRegression
x = df.iloc[:, 0].values.reshape(-1, 1)
y = df.iloc[:, 1].values.reshape(-1, 1)
model = LinearRegression().fit(x, y)
y_pred = model.predict([[2041]])
y_pred
Out[3323]:
array([[847.05675806]])
In [3310]:
a = pop.loc[pop.city.isin(['Carlton'])]
a[a.year==int("2037")].value.mean()
Out[3310]:
789.4583333333334
Machine Learning model to predict population based on city and year
In [3325]:
# main model code 
def main_model():
    print(city_names)
    print()
    city = str(input("Choose from the following list of cities : "))
    print()
    year = str(input("Enter the year you would like the population prediction of: "))
    

    # SGS's data that already exist
    if 2021 <= int(year) <= 2032:
        a = pop.loc[pop.city.isin([city])]
        print("SGS Prediction Value: ", round(a[a.year==int(year)].value.mean(),2))

    # my model prediction
    df = pop.loc[pop.city.isin([city])]
    df=df.drop(columns = ["age","gender","city"])
    # df
    from sklearn.linear_model import LinearRegression
    x = df.iloc[:, 0].values.reshape(-1, 1)
    y = df.iloc[:, 1].values.reshape(-1, 1)
    model = LinearRegression().fit(x, y)
    y_pred = model.predict([[int(year)]])
    print(f"Models's predicted value : {round(y_pred[0][0],2)}")
        
main_model()
['City of Melbourne', 'Carlton', 'Docklands', 'East Melbourne', 'Kensington', 'Melbourne (CBD)', 'Melbourne (Remainder)', 'North Melbourne', 'Parkville', 'Port Melbourne', 'South Yarra', 'Southbank', 'West Melbourne (Industrial)', 'West Melbourne (Residential)']


SGS Prediction Value:  463.67
Models's predicted value : 467.58
Summary

Throough this project we can see the overall distribution of the Jobs and Population from the Forecast data provided by SGS from 2020 to 2041. This findings in this project will be beneficial to the stakeholders which may include city planners, policy makers and entrepreneurs and many other entities. This project can help them understand how mange the resources within a particular area that we saw in this project. This project shows the change in population in different areas and the popular jobs in those area. This project also shows, the changes that will take place in furture in terms of population growth and job growth. The jobs are seperated by industries which would help the stakeholders organize resources to industries that will see a major growth in future and also the necessary infrastructure in those areas to support the population growth as a consequence. This project also utilizes the power of Machine Learning where the model predicts the growth that will take place in the population based on the city and year that are inputted to the model. The model uses present data to train an algorithm that can then effectively find the values for future data.

References

https://pandas.pydata.org/docs/reference/api/pandas.unique.html

https://note.nkmk.me/en/python-pandas-nextract/#:~:text=You%20can%20use%20the%20isnull,a%20missing%20value%20or%20not.&text=isnull()%20is%20an%20alias,and%20both%20are%20used%20interchangeably.

https://www.kaggle.com/code/malanep/finding-and-visualizing-missing-data-in-python

https://libraries.io/pypi/missingno

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

https://stackoverflow.com/questions/44869327/find-index-of-all-rows-with-null-values-in-a-particular-column-in-pandas-datafra

https://matplotlib.org/stable/gallery/statistics/hist.html#sphx-glr-gallery-statistics-hist-py

https://studyopedia.com/numpy/search-numpy-array-for-a-value/#:~:text=In%20a%20Numpy%20array%2C%20we,the%20specific%20element%20is%20found.&text=The%20out%20above%20displays%20%5B2,at%20index%202%20and%205.

https://saturncloud.io/blog/how-to-detect-and-exclude-outliers-in-a-pandas-dataframe/#:~:text=Outliers%20can%20be%20identified%20using,middle%2050%25%20of%20the%20data.